41 research outputs found
Fast Fourier Transforms on Distributed Memory Parallel Machines
One issue which is central in developing a general purpose subroutine on a distributed memory parallel machine is the data distribution. It is possible that users would like to use the subroutine with different data distributions. Thus there is a need to design algorithms on distributed memory parallel machines which can support a variety of data distributions. In this dissertation we have addressed the problem of developing such algorithms to compute the Discrete Fourier Transform (DFT) of real and complex data. The implementations given in this dissertation work for a class of data distributions commonly encountered in scientific applications, known as the block scattered data distributions. The implementations are targeted at distributed memory parallel machines. We have also addressed the problem of rearranging the data after computing the FFT. For computing the DFT of complex data, we use a standard Radix-2 FFT algorithm which has been studied extensively in parallel environment. There are two ways of computing the DFT of real data that are known to be efficient in serial environments: namely (i) the real fast Fourier transform (RFFT) algorithm, and (ii) the fast Hartley transform (FHT) algorithm. However, in distributed memory environments they have excessive communication overhead. We restructure the RFFT and FHT algorithms to reduce this overhead. The restructured RFFT and FHT algorithms are then used in the generalized implementations which work for block scattered data distributions. Experimental results are given for the restructured RFFT and the FHT algorithms on two parallel machines; NCUBE-7 which is a Hypercube MIMD machine and AMT DAP-510 which is a Mesh SIMD machine. The performances of the FFT, RFFT and FHT algorithms with block scattered data distribution were evaluated on Intel iPSC/860, a Hypercube MIMD machine
Extensible Component Based Architecture for FLASH, A Massively Parallel, Multiphysics Simulation Code
FLASH is a publicly available high performance application code which has
evolved into a modular, extensible software system from a collection of
unconnected legacy codes. FLASH has been successful because its capabilities
have been driven by the needs of scientific applications, without compromising
maintainability, performance, and usability. In its newest incarnation, FLASH3
consists of inter-operable modules that can be combined to generate different
applications. The FLASH architecture allows arbitrarily many alternative
implementations of its components to co-exist and interchange with each other,
resulting in greater flexibility. Further, a simple and elegant mechanism
exists for customization of code functionality without the need to modify the
core implementation of the source. A built-in unit test framework providing
verifiability, combined with a rigorous software maintenance process, allow the
code to operate simultaneously in the dual mode of production and development.
In this paper we describe the FLASH3 architecture, with emphasis on solutions
to the more challenging conflicts arising from solver complexity, portable
performance requirements, and legacy codes. We also include results from user
surveys conducted in 2005 and 2007, which highlight the success of the code.Comment: 33 pages, 7 figures; revised paper submitted to Parallel Computin
Fourth Workshop on Sustainable Software for Science: Practice and Experiences (WSSSPE4)
This report records and discusses the Fourth Workshop on Sustainable Software
for Science: Practice and Experiences (WSSSPE4). The report includes a
description of the keynote presentation of the workshop, the mission and vision
statements that were drafted at the workshop and finalized shortly after it, a
set of idea papers, position papers, experience papers, demos, and lightning
talks, and a panel discussion. The main part of the report covers the set of
working groups that formed during the meeting, and for each, discusses the
participants, the objective and goal, and how the objective can be reached,
along with contact information for readers who may want to join the group.
Finally, we present results from a survey of the workshop attendees
Star Formation in the First Galaxies I: Collapse Delayed by Lyman-Werner Radiation
We investigate the process of metal-free star formation in the first galaxies
with a high-resolution cosmological simulation. We consider the cosmologically
motivated scenario in which a strong molecule-destroying Lyman-Werner (LW)
background inhibits effective cooling in low-mass haloes, delaying star
formation until the collapse or more massive haloes. Only when molecular
hydrogen (H2) can self-shield from LW radiation, which requires a halo capable
of cooling by atomic line emission, will star formation be possible. To follow
the formation of multiple gravitationally bound objects, at high gas densities
we introduce sink particles which accrete gas directly from the computational
grid. We find that in a 1 Mpc^3 (comoving) box, runaway collapse first occurs
in a 3x10^7 M_sun dark matter halo at z~12 assuming a background intensity of
J21=100. Due to a runaway increase in the H2 abundance and cooling rate, a
self-shielding, supersonically turbulent core develops abruptly with ~10^4
M_sun in cold gas available for star formation. We analyze the formation of
this self-shielding core, the character of turbulence, and the prospects for
star formation. Due to a lack of fragmentation on scales we resolve, we argue
that LW-delayed metal-free star formation in atomic cooling haloes is very
similar to star formation in primordial minihaloes, although in making this
conclusion we ignore internal stellar feedback. Finally, we briefly discuss the
detectability of metal-free stellar clusters with the James Webb Space
Telescope.Comment: 22 pages, 1 new figure, accepted for publication in MNRA
Recommended from our members
Modeling the Office of Science Ten Year Facilities Plan: The PERI Architecture Tiger Team
The Performance Engineering Institute (PERI) originally proposed a tiger team activity as a mechanism to target significant effort optimizing key Office of Science applications, a model that was successfully realized with the assistance of two JOULE metric teams. However, the Office of Science requested a new focus beginning in 2008: assistance in forming its ten year facilities plan. To meet this request, PERI formed the Architecture Tiger Team, which is modeling the performance of key science applications on future architectures, with S3D, FLASH and GTC chosen as the first application targets. In this activity, we have measured the performance of these applications on current systems in order to understand their baseline performance and to ensure that our modeling activity focuses on the right versions and inputs of the applications. We have applied a variety of modeling techniques to anticipate the performance of these applications on a range of anticipated systems. While our initial findings predict that Office of Science applications will continue to perform well on future machines from major hardware vendors, we have also encountered several areas in which we must extend our modeling techniques in order to fulfill our mission accurately and completely. In addition, we anticipate that models of a wider range of applications will reveal critical differences between expected future systems, thus providing guidance for future Office of Science procurement decisions, and will enable DOE applications to exploit machines in future facilities fully
Programming Abstractions for Data Locality
The goal of the workshop and this report is to identify common themes and standardize concepts for locality-preserving abstractions for exascale programming models. Current software tools are built on the premise that computing is the most expensive component, we are rapidly moving to an era that computing is cheap and massively parallel while data movement dominates energy and performance costs. In order to respond to exascale systems (the next generation of high performance computing systems), the scientific computing community needs to refactor their applications to align with the emerging data-centric paradigm. Our applications must be evolved to express information about data locality. Unfortunately current programming environments offer few ways to do so. They ignore the incurred cost of communication and simply rely on the hardware cache coherency to virtualize data movement. With the increasing importance of task-level parallelism on future systems, task models have to support constructs that express data locality and affinity. At the system level, communication libraries implicitly assume all the processing elements are equidistant to each other. In order to take advantage of emerging technologies, application developers need a set of programming abstractions to describe data locality for the new computing ecosystem. The new programming paradigm should be more data centric and allow to describe how to decompose and how to layout data in the memory.Fortunately, there are many emerging concepts such as constructs for tiling, data layout, array views, task and thread affinity, and topology aware communication libraries for managing data locality. There is an opportunity to identify commonalities in strategy to enable us to combine the best of these concepts to develop a comprehensive approach to expressing and managing data locality on exascale programming systems. These programming model abstractions can expose crucial information about data locality to the compiler and runtime system to enable performance-portable code. The research question is to identify the right level of abstraction, which includes techniques that range from template libraries all the way to completely new languages to achieve this goal